116 research outputs found
Convergence Rate of Frank-Wolfe for Non-Convex Objectives
We give a simple proof that the Frank-Wolfe algorithm obtains a stationary
point at a rate of on non-convex objectives with a Lipschitz
continuous gradient. Our analysis is affine invariant and is the first, to the
best of our knowledge, giving a similar rate to what was already proven for
projected gradient methods (though on slightly different measures of
stationarity).Comment: 6 page
PAC-Bayesian Theory Meets Bayesian Inference
We exhibit a strong link between frequentist PAC-Bayesian risk bounds and the
Bayesian marginal likelihood. That is, for the negative log-likelihood loss
function, we show that the minimization of PAC-Bayesian generalization risk
bounds maximizes the Bayesian marginal likelihood. This provides an alternative
explanation to the Bayesian Occam's razor criteria, under the assumption that
the data is generated by an i.i.d distribution. Moreover, as the negative
log-likelihood is an unbounded loss function, we motivate and propose a
PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that
our approach is sound on classical Bayesian linear regression tasks.Comment: Published at NIPS 2015
(http://papers.nips.cc/paper/6569-pac-bayesian-theory-meets-bayesian-inference
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous
variants of stochastic gradient descent have become popular methods to solve a
wide range of large-scale optimization problems on multi-core architectures.
Yet, despite their practical success, support for nonsmooth objectives is still
lacking, making them unsuitable for many problems of interest in machine
learning, such as the Lasso, group Lasso or empirical risk minimization with
convex constraints.
In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse
method inspired by SAGA, a variance reduced incremental gradient algorithm. The
proposed method is easy to implement and significantly outperforms the state of
the art on several nonsmooth, large-scale problems. We prove that our method
achieves a theoretical linear speedup with respect to the sequential version
under assumptions on the sparsity of gradients and block-separability of the
proximal term. Empirical benchmarks on a multi-core architecture illustrate
practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS
2017), 28 page
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
In this work we introduce a new optimisation method called SAGA in the spirit
of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient
algorithms with fast linear convergence rates. SAGA improves on the theory
behind SAG and SVRG, with better theoretical convergence rates, and has support
for composite objectives where a proximal operator is used on the regulariser.
Unlike SDCA, SAGA supports non-strongly convex problems directly, and is
adaptive to any inherent strong convexity of the problem. We give experimental
results showing the effectiveness of our method.Comment: Advances In Neural Information Processing Systems, Nov 2014,
Montreal, Canad
Frank-Wolfe Algorithms for Saddle Point Problems
We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained
smooth convex-concave saddle point (SP) problems. Remarkably, the method only
requires access to linear minimization oracles. Leveraging recent advances in
FW optimization, we provide the first proof of convergence of a FW-type saddle
point solver over polytopes, thereby partially answering a 30 year-old
conjecture. We also survey other convergence results and highlight gaps in the
theoretical underpinnings of FW-style algorithms. Motivating applications
without known efficient alternatives are explored through structured prediction
with combinatorial penalties as well as games over matching polytopes involving
an exponential number of constraints.Comment: Appears in: Proceedings of the 20th International Conference on
Artificial Intelligence and Statistics (AISTATS 2017). 39 page
Rethinking LDA: moment matching for discrete ICA
We consider moment matching techniques for estimation in Latent Dirichlet
Allocation (LDA). By drawing explicit links between LDA and discrete versions
of independent component analysis (ICA), we first derive a new set of
cumulant-based tensors, with an improved sample complexity. Moreover, we reuse
standard ICA techniques such as joint diagonalization of tensors to improve
over existing methods based on the tensor power method. In an extensive set of
experiments on both synthetic and real datasets, we show that our new
combination of tensors and orthogonal joint diagonalization techniques
outperforms existing moment matching methods.Comment: 30 pages; added plate diagrams and clarifications, changed style,
corrected typos, updated figures. in Proceedings of the 29-th Conference on
Neural Information Processing Systems (NIPS), 201
On the Equivalence between Herding and Conditional Gradient Algorithms
We show that the herding procedure of Welling (2009) takes exactly the form
of a standard convex optimization algorithm--namely a conditional gradient
algorithm minimizing a quadratic moment discrepancy. This link enables us to
invoke convergence results from convex optimization and to consider faster
alternatives for the task of approximating integrals in a reproducing kernel
Hilbert space. We study the behavior of the different variants through
numerical simulations. The experiments indicate that while we can improve over
herding on the task of approximating integrals, the original herding algorithm
tends to approach more often the maximum entropy distribution, shedding more
light on the learning bias behind herding
- …